• PROJECT OBJECTIVE: We will build a classifier to predict the Pass/Fail yield of a particular process entity and analyse whether all the features are required to build the model or not.

Steps and tasks:

1. Import and understand the data

In [119]:
import numpy as np
import pandas as pd
import seaborn as sns
import plotly.express as px
from sklearn import svm
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import KFold
from imblearn.over_sampling import SMOTE
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier,IsolationForest
from sklearn.preprocessing import StandardScaler
from sklearn import metrics
from xgboost import XGBClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import LeaveOneOut
from sklearn.model_selection import cross_val_score
from sklearn.metrics import confusion_matrix, accuracy_score, f1_score, precision_score, recall_score, average_precision_score
import matplotlib.pyplot as plt
%matplotlib inline
import warnings
warnings.filterwarnings(action='ignore')

1.A. Import ‘signal-data.csv’ as DataFrame

In [2]:
df= pd.read_csv('signal-data.csv')
In [3]:
df.head()
Out[3]:
Time 0 1 2 3 4 5 6 7 8 ... 581 582 583 584 585 586 587 588 589 Pass/Fail
0 2008-07-19 11:55:00 3030.93 2564.00 2187.7333 1411.1265 1.3602 100.0 97.6133 0.1242 1.5005 ... NaN 0.5005 0.0118 0.0035 2.3630 NaN NaN NaN NaN -1
1 2008-07-19 12:32:00 3095.78 2465.14 2230.4222 1463.6606 0.8294 100.0 102.3433 0.1247 1.4966 ... 208.2045 0.5019 0.0223 0.0055 4.4447 0.0096 0.0201 0.0060 208.2045 -1
2 2008-07-19 13:17:00 2932.61 2559.94 2186.4111 1698.0172 1.5102 100.0 95.4878 0.1241 1.4436 ... 82.8602 0.4958 0.0157 0.0039 3.1745 0.0584 0.0484 0.0148 82.8602 1
3 2008-07-19 14:43:00 2988.72 2479.90 2199.0333 909.7926 1.3204 100.0 104.2367 0.1217 1.4882 ... 73.8432 0.4990 0.0103 0.0025 2.0544 0.0202 0.0149 0.0044 73.8432 -1
4 2008-07-19 15:22:00 3032.24 2502.87 2233.3667 1326.5200 1.5334 100.0 100.3967 0.1235 1.5031 ... NaN 0.4800 0.4766 0.1045 99.3032 0.0202 0.0149 0.0044 73.8432 -1

5 rows × 592 columns

1.B. Print 5 point summary and share at least 2 observations

In [4]:
df.describe().T
Out[4]:
count mean std min 25% 50% 75% max
0 1561.0 3014.452896 73.621787 2743.2400 2966.260000 3011.4900 3056.6500 3356.3500
1 1560.0 2495.850231 80.407705 2158.7500 2452.247500 2499.4050 2538.8225 2846.4400
2 1553.0 2200.547318 29.513152 2060.6600 2181.044400 2201.0667 2218.0555 2315.2667
3 1553.0 1396.376627 441.691640 0.0000 1081.875800 1285.2144 1591.2235 3715.0417
4 1553.0 4.197013 56.355540 0.6815 1.017700 1.3168 1.5257 1114.5366
... ... ... ... ... ... ... ... ...
586 1566.0 0.021458 0.012358 -0.0169 0.013425 0.0205 0.0276 0.1028
587 1566.0 0.016475 0.008808 0.0032 0.010600 0.0148 0.0203 0.0799
588 1566.0 0.005283 0.002867 0.0010 0.003300 0.0046 0.0064 0.0286
589 1566.0 99.670066 93.891919 0.0000 44.368600 71.9005 114.7497 737.3048
Pass/Fail 1567.0 -0.867262 0.498010 -1.0000 -1.000000 -1.0000 -1.0000 1.0000

591 rows × 8 columns

  • There is large variation in the mean values between features.
  • There are features (like '5') which have same entries in all the rows.
  • There are features (like '9') with low values.
  • The total number of entries vary across features indicating the presence of null/NaN values.

Since the 'Time' data is not useful in the context of the problem, it is dropped.

In [5]:
df = df.drop('Time',axis=1)

2. Data cleansing:

2.A. Write a for loop which will remove all the features with 20%+ Null values and impute rest with mean of the feature.

In [6]:
df.isnull().sum()
Out[6]:
0             6
1             7
2            14
3            14
4            14
             ..
586           1
587           1
588           1
589           1
Pass/Fail     0
Length: 591, dtype: int64
In [7]:
df.shape
Out[7]:
(1567, 591)
In [8]:
print('Number of rows   : ',df.shape[0])
print('Number of columns: ',df.shape[1])
Number of rows   :  1567
Number of columns:  591
In [9]:
df['Pass/Fail'].unique()
Out[9]:
array([-1,  1], dtype=int64)
In [10]:
df['Pass/Fail'] = df['Pass/Fail'].replace(to_replace=1,value=0)
df['Pass/Fail'] = df['Pass/Fail'].replace(to_replace=-1,value=1)
In [11]:
d = df.isnull().sum() * 100 / len(df)
j = []
for i in d.keys():
    if(d[i] >=0.20):
        print(i, d[i])
        j.append(i)
0 0.3828972559029994
1 0.4467134652201659
2 0.8934269304403318
3 0.8934269304403318
4 0.8934269304403318
5 0.8934269304403318
6 0.8934269304403318
7 0.574345883854499
19 0.6381620931716656
40 1.5315890236119976
41 1.5315890236119976
53 0.2552648372686662
54 0.2552648372686662
55 0.2552648372686662
56 0.2552648372686662
57 0.2552648372686662
58 0.2552648372686662
59 0.4467134652201659
60 0.3828972559029994
61 0.3828972559029994
62 0.3828972559029994
63 0.4467134652201659
64 0.4467134652201659
65 0.4467134652201659
66 0.3828972559029994
67 0.3828972559029994
68 0.3828972559029994
69 0.3828972559029994
70 0.3828972559029994
71 0.3828972559029994
72 50.67007019783025
73 50.67007019783025
74 0.3828972559029994
75 1.5315890236119976
76 1.5315890236119976
77 1.5315890236119976
78 1.5315890236119976
79 1.5315890236119976
80 1.5315890236119976
81 1.5315890236119976
82 1.5315890236119976
84 0.7657945118059988
85 85.57753669432036
89 3.2546266751754946
90 3.2546266751754946
91 0.3828972559029994
94 0.3828972559029994
95 0.3828972559029994
96 0.3828972559029994
97 0.3828972559029994
98 0.3828972559029994
99 0.3828972559029994
100 0.3828972559029994
101 0.3828972559029994
102 0.3828972559029994
105 0.3828972559029994
106 0.3828972559029994
107 0.3828972559029994
108 0.3828972559029994
109 64.96490108487556
110 64.96490108487556
111 64.96490108487556
112 45.62858966177409
118 1.5315890236119976
121 0.574345883854499
122 0.574345883854499
123 0.574345883854499
124 0.574345883854499
125 0.574345883854499
126 0.574345883854499
127 0.574345883854499
128 0.574345883854499
129 0.574345883854499
130 0.574345883854499
131 0.574345883854499
132 0.5105296745373324
133 0.5105296745373324
134 0.5105296745373324
135 0.3190810465858328
136 0.3828972559029994
137 0.4467134652201659
138 0.8934269304403318
139 0.8934269304403318
140 0.8934269304403318
141 0.8934269304403318
142 0.8934269304403318
143 0.574345883854499
155 0.6381620931716656
157 91.19336311423102
158 91.19336311423102
178 1.5315890236119976
190 0.2552648372686662
191 0.2552648372686662
192 0.2552648372686662
193 0.2552648372686662
194 0.2552648372686662
195 0.2552648372686662
196 0.4467134652201659
197 0.3828972559029994
198 0.3828972559029994
199 0.3828972559029994
200 0.4467134652201659
201 0.4467134652201659
202 0.4467134652201659
203 0.3828972559029994
204 0.3828972559029994
205 0.3828972559029994
206 0.3828972559029994
207 0.3828972559029994
208 0.3828972559029994
209 0.3828972559029994
210 1.5315890236119976
211 1.5315890236119976
212 1.5315890236119976
213 1.5315890236119976
214 1.5315890236119976
215 1.5315890236119976
216 1.5315890236119976
217 1.5315890236119976
219 0.7657945118059988
220 85.57753669432036
224 3.2546266751754946
225 3.2546266751754946
226 0.3828972559029994
229 0.3828972559029994
230 0.3828972559029994
231 0.3828972559029994
232 0.3828972559029994
233 0.3828972559029994
234 0.3828972559029994
235 0.3828972559029994
236 0.3828972559029994
237 0.3828972559029994
240 0.3828972559029994
241 0.3828972559029994
242 0.3828972559029994
243 0.3828972559029994
244 64.96490108487556
245 64.96490108487556
246 64.96490108487556
247 45.62858966177409
253 1.5315890236119976
256 0.574345883854499
257 0.574345883854499
258 0.574345883854499
259 0.574345883854499
260 0.574345883854499
261 0.574345883854499
262 0.574345883854499
263 0.574345883854499
264 0.574345883854499
265 0.574345883854499
266 0.574345883854499
267 0.5105296745373324
268 0.5105296745373324
269 0.5105296745373324
270 0.3190810465858328
271 0.3828972559029994
272 0.4467134652201659
273 0.8934269304403318
274 0.8934269304403318
275 0.8934269304403318
276 0.8934269304403318
277 0.8934269304403318
278 0.574345883854499
290 0.6381620931716656
292 91.19336311423102
293 91.19336311423102
313 1.5315890236119976
314 1.5315890236119976
326 0.2552648372686662
327 0.2552648372686662
328 0.2552648372686662
329 0.2552648372686662
330 0.2552648372686662
331 0.2552648372686662
332 0.4467134652201659
333 0.3828972559029994
334 0.3828972559029994
335 0.3828972559029994
336 0.4467134652201659
337 0.4467134652201659
338 0.4467134652201659
339 0.3828972559029994
340 0.3828972559029994
341 0.3828972559029994
342 0.3828972559029994
343 0.3828972559029994
344 0.3828972559029994
345 50.67007019783025
346 50.67007019783025
347 0.3828972559029994
348 1.5315890236119976
349 1.5315890236119976
350 1.5315890236119976
351 1.5315890236119976
352 1.5315890236119976
353 1.5315890236119976
354 1.5315890236119976
355 1.5315890236119976
357 0.7657945118059988
358 85.57753669432036
362 3.2546266751754946
363 3.2546266751754946
364 0.3828972559029994
367 0.3828972559029994
368 0.3828972559029994
369 0.3828972559029994
370 0.3828972559029994
371 0.3828972559029994
372 0.3828972559029994
373 0.3828972559029994
374 0.3828972559029994
375 0.3828972559029994
378 0.3828972559029994
379 0.3828972559029994
380 0.3828972559029994
381 0.3828972559029994
382 64.96490108487556
383 64.96490108487556
384 64.96490108487556
385 45.62858966177409
391 1.5315890236119976
394 0.574345883854499
395 0.574345883854499
396 0.574345883854499
397 0.574345883854499
398 0.574345883854499
399 0.574345883854499
400 0.574345883854499
401 0.574345883854499
402 0.574345883854499
403 0.574345883854499
404 0.574345883854499
405 0.5105296745373324
406 0.5105296745373324
407 0.5105296745373324
408 0.3190810465858328
409 0.3828972559029994
410 0.4467134652201659
411 0.8934269304403318
412 0.8934269304403318
413 0.8934269304403318
414 0.8934269304403318
415 0.8934269304403318
416 0.574345883854499
428 0.6381620931716656
449 1.5315890236119976
450 1.5315890236119976
462 0.2552648372686662
463 0.2552648372686662
464 0.2552648372686662
465 0.2552648372686662
466 0.2552648372686662
467 0.2552648372686662
468 0.4467134652201659
469 0.3828972559029994
470 0.3828972559029994
471 0.3828972559029994
472 0.4467134652201659
473 0.4467134652201659
474 0.4467134652201659
475 0.3828972559029994
476 0.3828972559029994
477 0.3828972559029994
478 0.3828972559029994
479 0.3828972559029994
480 0.3828972559029994
481 0.3828972559029994
482 1.5315890236119976
483 1.5315890236119976
484 1.5315890236119976
485 1.5315890236119976
486 1.5315890236119976
487 1.5315890236119976
488 1.5315890236119976
489 1.5315890236119976
491 0.7657945118059988
492 85.57753669432036
496 3.2546266751754946
497 3.2546266751754946
498 0.3828972559029994
501 0.3828972559029994
502 0.3828972559029994
503 0.3828972559029994
504 0.3828972559029994
505 0.3828972559029994
506 0.3828972559029994
507 0.3828972559029994
508 0.3828972559029994
509 0.3828972559029994
512 0.3828972559029994
513 0.3828972559029994
514 0.3828972559029994
515 0.3828972559029994
516 64.96490108487556
517 64.96490108487556
518 64.96490108487556
519 45.62858966177409
525 1.5315890236119976
528 0.574345883854499
529 0.574345883854499
530 0.574345883854499
531 0.574345883854499
532 0.574345883854499
533 0.574345883854499
534 0.574345883854499
535 0.574345883854499
536 0.574345883854499
537 0.574345883854499
538 0.574345883854499
539 0.5105296745373324
540 0.5105296745373324
541 0.5105296745373324
546 16.592214422463307
547 16.592214422463307
548 16.592214422463307
549 16.592214422463307
550 16.592214422463307
551 16.592214422463307
552 16.592214422463307
553 16.592214422463307
554 16.592214422463307
555 16.592214422463307
556 16.592214422463307
557 16.592214422463307
562 17.42182514358647
563 17.42182514358647
564 17.42182514358647
565 17.42182514358647
566 17.42182514358647
567 17.42182514358647
568 17.42182514358647
569 17.42182514358647
578 60.561582641991066
579 60.561582641991066
580 60.561582641991066
581 60.561582641991066
In [12]:
df.drop(j, axis = 1, inplace = True)
In [13]:
# Fill remaining missing values
for column in df.columns:
    df[column] = df[column].fillna(df[column].mean())
In [14]:
df.sample(5)
Out[14]:
8 9 10 11 12 13 14 15 16 17 ... 577 582 583 584 585 586 587 588 589 Pass/Fail
1414 1.4425 -0.0291 -0.0025 0.9644 196.9962 0.0 12.9242 427.2639 9.7263 0.9706 ... 20.7331 0.5035 0.0207 0.0045 4.1018 0.0122 0.0080 0.0028 65.4842 1
946 1.6057 0.0023 0.0070 0.9695 200.0485 0.0 6.9335 403.5455 9.6994 0.9733 ... 14.8682 0.5034 0.0166 0.0042 3.2954 0.0052 0.0203 0.0068 390.4146 1
204 1.4310 0.0090 0.0033 0.9640 208.1239 0.0 14.0538 423.2695 9.8416 0.9707 ... 10.3231 0.5073 0.0157 0.0040 3.0935 0.0123 0.0094 0.0026 76.4584 1
1444 1.6123 -0.0166 0.0017 0.9603 201.4521 0.0 8.7335 416.3166 9.4070 0.9675 ... 17.5152 0.5020 0.0117 0.0031 2.3308 0.0274 0.0121 0.0040 44.0961 1
971 1.4314 0.0200 0.0062 0.9691 196.2512 0.0 6.3496 402.4503 10.0109 0.9792 ... 9.3821 0.4987 0.0112 0.0027 2.2544 0.0223 0.0159 0.0053 71.0108 1

5 rows × 253 columns

In [15]:
df.isnull().sum()
Out[15]:
8            0
9            0
10           0
11           0
12           0
            ..
586          0
587          0
588          0
589          0
Pass/Fail    0
Length: 253, dtype: int64

2.B. Identify and drop the features which are having same value for all the rows.

In [16]:
def remove_duplicates(df):
    df_std = df.std()
    duplicate_features = df_std[df_std == 0].index
    print('Number of features removed with same row values t:',
          len(duplicate_features))
    df = df.drop(labels=duplicate_features, axis=1)
    return (df)
In [17]:
dup = remove_duplicates(df)
Number of features removed with same row values t: 16

2.C. Drop other features if required using relevant functional knowledge. Clearly justify the same

In [18]:
#after dropping the constant signal
row,column=df.shape
print('After dropping the constant signals the dataset contains', row, 'rows and', column, 'columns')
After dropping the constant signals the dataset contains 1567 rows and 253 columns
  • Since we dropped some of the rows ,there is no need to drop remaining rows.

2.D. Check for multi-collinearity in the data and take necessary action.

In [19]:
df.corr()
Out[19]:
8 9 10 11 12 13 14 15 16 17 ... 577 582 583 584 585 586 587 588 589 Pass/Fail
8 1.000000 -0.152133 0.058386 -0.065065 0.005457 NaN -0.085784 -0.065719 0.039571 -0.014963 ... 0.010671 -0.027611 0.017390 0.019951 0.017925 0.010433 0.022845 0.026250 -0.022770 -0.028016
9 -0.152133 1.000000 -0.064065 -0.008692 -0.065125 NaN -0.004359 0.050226 -0.027492 -0.015672 ... 0.030169 0.046209 -0.036036 -0.032583 -0.036109 0.033738 0.059301 0.060758 0.004880 0.031191
10 0.058386 -0.064065 1.000000 -0.021358 -0.052534 NaN -0.077850 -0.004057 -0.062439 -0.032645 ... 0.029452 -0.073704 0.039060 0.039115 0.039447 0.000327 0.046965 0.046048 0.008393 -0.033639
11 -0.065065 -0.008692 -0.021358 1.000000 -0.212828 NaN 0.064007 0.003347 -0.211808 0.793896 ... 0.015892 0.015412 -0.013729 -0.017338 -0.014046 0.023005 -0.014900 -0.009667 0.015281 0.032620
12 0.005457 -0.065125 -0.052534 -0.212828 1.000000 NaN -0.012805 -0.033933 0.552542 -0.104559 ... 0.031434 0.038282 0.000696 0.002081 0.000523 0.037056 -0.012258 -0.012759 -0.036720 0.005969
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
586 0.010433 0.033738 0.000327 0.023005 0.037056 NaN -0.063512 -0.011120 -0.054480 0.015993 ... -0.002684 -0.016726 0.002257 0.001605 0.002743 1.000000 0.167913 0.164238 -0.486559 -0.004156
587 0.022845 0.059301 0.046965 -0.014900 -0.012258 NaN 0.044234 0.016059 0.026357 -0.032242 ... -0.009405 -0.024473 -0.002649 -0.002498 -0.002930 0.167913 1.000000 0.974276 0.390813 -0.035391
588 0.026250 0.060758 0.046048 -0.009667 -0.012759 NaN 0.036585 0.014375 0.028829 -0.021609 ... -0.015596 -0.020705 -0.002260 -0.001957 -0.002530 0.164238 0.974276 1.000000 0.389211 -0.031167
589 -0.022770 0.004880 0.008393 0.015281 -0.036720 NaN 0.068161 0.009764 -0.013918 -0.013233 ... -0.024766 0.041486 -0.003008 -0.003295 -0.003800 -0.486559 0.390813 0.389211 1.000000 0.002653
Pass/Fail -0.028016 0.031191 -0.033639 0.032620 0.005969 NaN 0.068975 0.002884 -0.002356 0.009697 ... 0.049633 -0.047020 -0.005981 -0.005419 -0.005034 -0.004156 -0.035391 -0.031167 0.002653 1.000000

253 rows × 253 columns

In [20]:
plt.figure(figsize = (10,6))
sns.heatmap(abs(df.corr()), vmin = 0, vmax = 1)
plt.show()

2.E. Make all relevant modifications on the data using both functional/logical reasoning/assumptions

  • Dropped the columns which are having null percentage more than 20% and imputing rest of the features with mean.
  • Dropped the features which are having same values or duplicates.
  • understanding the shape of dataframe and dropping the features if it is relevant.
  • Checking the correlation of features.

3. Data analysis & visualisation:

3.A. Perform a detailed univariate Analysis with appropriate detailed comments after each analysis

In [21]:
sns.histplot(data=df,x='Pass/Fail');
In [22]:
df['Pass/Fail'].value_counts()
Out[22]:
1    1463
0     104
Name: Pass/Fail, dtype: int64
  • The count of Pass is more.

3.B. Perform bivariate and multivariate analysis with appropriate detailed comments after each analysis.

In [136]:
plt.figure(figsize = (10,6))
sns.heatmap(abs(df), vmin = 0, vmax = 1)
plt.show()
  • Lighter the cell, higher the correlation
  • There is a near 100% multi-collinearity among a few variables
In [23]:
fig = px.pie(
    df['Pass/Fail'].value_counts(),
    values='Pass/Fail',
    names=["PASS", "FAIL"],
    title="Class Distribution",
    width=500
)

fig.show()
  • Fail percentage is very low
  • Since the data is not balanced the fail percentage is very low

4. Data pre-processing:

4.A. Segregate predictors vs target attributes

In [24]:
X = df.drop(labels='Pass/Fail',axis=1)
y = df['Pass/Fail']
In [25]:
X = X.add_prefix('f')

4.B. Check for target balancing and fix it if found imbalanced

In [26]:
y.value_counts()
Out[26]:
1    1463
0     104
Name: Pass/Fail, dtype: int64
In [27]:
oversample = SMOTE()
X, y = oversample.fit_resample(X, y)
In [28]:
y.value_counts()
Out[28]:
1    1463
0    1463
Name: Pass/Fail, dtype: int64

4.C. Perform train-test split and standardise the data or vice versa if required

In [29]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 0)
In [30]:
sc = StandardScaler()
X_train_std = sc.fit_transform(X_train)
X_test_std = sc.transform(X_test)

4.D. Check if the train and test data have similar statistical characteristics when compared with original data.

In [31]:
X_train.describe()
Out[31]:
f8 f9 f10 f11 f12 f13 f14 f15 f16 f17 ... f576 f577 f582 f583 f584 f585 f586 f587 f588 f589
count 2194.000000 2194.000000 2194.000000 2194.000000 2194.000000 2194.0 2194.000000 2194.000000 2194.000000 2194.000000 ... 2194.000000 2194.000000 2194.000000 2194.000000 2194.000000 2194.000000 2194.000000 2194.000000 2194.000000 2194.000000
mean 1.467835 -0.001395 0.000271 0.963547 199.879413 0.0 8.733136 412.766208 9.917102 0.971261 ... 4.443181 15.537661 0.500339 0.015338 0.003839 3.066950 0.021836 0.017202 0.005493 97.809169
std 0.062518 0.013418 0.008686 0.011158 2.959540 0.0 2.543236 7.787431 2.043105 0.010433 ... 13.711164 9.413255 0.003233 0.011109 0.002384 2.303349 0.011362 0.008256 0.002659 80.467157
min 1.191000 -0.053400 -0.034900 0.655400 182.094000 0.0 2.353269 333.448600 4.469600 0.579400 ... 0.918200 4.582000 0.477800 0.006000 0.001700 1.197500 -0.016900 0.003200 0.001000 0.000000
25% 1.427840 -0.009700 -0.005011 0.957442 198.290941 0.0 6.909125 407.522250 9.599800 0.967853 ... 1.430993 11.596221 0.498269 0.011700 0.003100 2.335690 0.014501 0.011385 0.003680 49.536751
50% 1.466353 -0.001526 0.000700 0.964600 199.608973 0.0 8.743552 412.394646 9.880543 0.971800 ... 1.634951 13.778300 0.500460 0.014000 0.003614 2.798957 0.020976 0.015893 0.005025 75.980846
75% 1.510217 0.006200 0.005759 0.969850 201.680375 0.0 10.453050 417.978272 10.146196 0.975900 ... 1.888054 16.595782 0.502400 0.016852 0.004100 3.362556 0.027800 0.021025 0.006810 118.025232
max 1.653900 0.074900 0.053000 0.984800 272.045100 0.0 19.546500 448.465800 102.867700 0.984800 ... 90.423500 96.960100 0.509800 0.471400 0.103900 98.662800 0.102800 0.079900 0.028600 737.304800

8 rows × 252 columns

In [32]:
X_test.describe()
Out[32]:
f8 f9 f10 f11 f12 f13 f14 f15 f16 f17 ... f576 f577 f582 f583 f584 f585 f586 f587 f588 f589
count 732.000000 732.000000 732.000000 732.000000 732.000000 732.0 732.000000 732.000000 732.000000 732.000000 ... 732.000000 732.000000 732.000000 732.000000 732.000000 732.000000 732.000000 732.000000 732.000000 732.000000
mean 1.463555 -0.001164 0.000500 0.964052 199.869405 0.0 8.781439 413.847780 9.899025 0.971655 ... 4.178027 15.571448 0.500407 0.015593 0.003902 3.120610 0.021634 0.017003 0.005447 98.966544
std 0.065387 0.013365 0.008323 0.008850 2.334991 0.0 2.536592 22.899211 0.465671 0.006294 ... 13.282749 10.615185 0.003357 0.017773 0.003847 3.694953 0.010895 0.008007 0.002545 81.977830
min 1.200500 -0.039300 -0.028900 0.932000 191.077700 0.0 2.249300 390.836700 8.036700 0.950200 ... 0.663600 5.437700 0.480000 0.006800 0.001800 1.366700 -0.004700 0.004500 0.001200 0.000000
25% 1.422681 -0.009200 -0.004598 0.957700 198.317104 0.0 7.021800 407.365950 9.644178 0.968077 ... 1.400631 11.711850 0.498200 0.011700 0.003120 2.338407 0.014915 0.011200 0.003500 49.702073
50% 1.461460 -0.001576 0.000900 0.964950 199.605490 0.0 8.691596 412.479742 9.907738 0.972151 ... 1.659832 13.872580 0.500500 0.014022 0.003652 2.800850 0.020950 0.015502 0.005093 76.361078
75% 1.506376 0.007057 0.006055 0.970397 201.602320 0.0 10.470400 418.153307 10.169932 0.976500 ... 1.892872 16.366790 0.502593 0.016401 0.004100 3.292175 0.027500 0.021000 0.006702 116.590069
max 1.656400 0.048400 0.039600 0.982200 208.123900 0.0 16.884400 824.927100 12.477400 0.984300 ... 88.177400 96.960100 0.509800 0.476600 0.104500 99.303200 0.102800 0.070100 0.020800 579.181700

8 rows × 252 columns

In [33]:
y_train.describe()
Out[33]:
count    2194.000000
mean        0.490884
std         0.500031
min         0.000000
25%         0.000000
50%         0.000000
75%         1.000000
max         1.000000
Name: Pass/Fail, dtype: float64
In [34]:
y_test.describe()
Out[34]:
count    732.000000
mean       0.527322
std        0.499594
min        0.000000
25%        0.000000
50%        1.000000
75%        1.000000
max        1.000000
Name: Pass/Fail, dtype: float64

5. Model training, testing and tuning:

5.A. Use any Supervised Learning technique to train a model.

In [35]:
logit = LogisticRegression()
logit.fit(X_train, y_train)
logit_pred = logit.predict(X_test)

print('Accuracy on Training data:',logit.score(X_train, y_train) )
print('Accuracy on Test data:',logit.score(X_test, y_test) )
Accuracy on Training data: 0.6754785779398359
Accuracy on Test data: 0.6844262295081968
In [36]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import recall_score
# instantiate learning model (k = 3)
knn = KNeighborsClassifier(n_neighbors = 3)
# fitting the model
knn.fit(X_train, y_train)
# predict the response
y_pred = knn.predict(X_test)
# evaluate accuracy
print(accuracy_score(y_test, y_pred))
0.8524590163934426

5.B. Use cross validation techniques

In [37]:
num_folds = 50
seed = 7

kfold = KFold(n_splits=num_folds, random_state=None)
model = LogisticRegression()
results = cross_val_score(model,X,y, cv=kfold)
print(results)
print("Accuracy: %.3f%% (%.3f%%)" % (results.mean()*100.0, results.std()*100.0))
[0.44067797 0.44067797 0.40677966 0.45762712 0.71186441 0.6440678
 0.49152542 0.6440678  0.79661017 0.6440678  0.76271186 0.6779661
 0.79661017 0.45762712 0.74576271 0.69491525 0.72881356 0.6779661
 0.59322034 0.54237288 0.66101695 0.72881356 0.62711864 0.79661017
 0.76271186 0.6779661  0.74137931 0.5862069  0.65517241 0.70689655
 0.56896552 0.63793103 0.60344828 0.70689655 0.68965517 0.55172414
 0.63793103 0.56896552 0.68965517 0.60344828 0.65517241 0.75862069
 0.72413793 0.67241379 0.67241379 0.65517241 0.5862069  0.74137931
 0.55172414 0.74137931]
Accuracy: 64.634% (9.785%)
In [38]:
# loocv to automatically evaluate the performance of a random forest classifier
from numpy import mean
from numpy import std
from sklearn.datasets import make_blobs
from sklearn.model_selection import LeaveOneOut
from sklearn.model_selection import cross_val_score
# create dataset
X, y = make_blobs(n_samples=100, random_state=1)
# create loocv procedure
cv = LeaveOneOut()
# create model
model =LogisticRegression()
# evaluate model
scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
# report performance
print("Accuracy: %.3f%% (%.3f%%)" % (results.mean()*100.0, results.std()*100.0))
Accuracy: 64.634% (9.785%)
In [39]:
from sklearn.model_selection import cross_val_score

dt = DecisionTreeClassifier()
score1 = cross_val_score(dt, X, y, cv = 10).mean()
print(f'Cross validation score of Decision tree = {score1}')
Cross validation score of Decision tree = 0.9800000000000001
In [40]:
#Random Forest rf
rf = RandomForestClassifier() 
score2 = cross_val_score(rf, X, y, cv = 10).mean()
print(f'Cross validation score of Random forest = {score2}')
Cross validation score of Random forest = 0.99

5.C.Apply hyper-parameter tuning techniques to get the best accuracy

In [41]:
#Logistic Regression with only feedback columns
from sklearn.linear_model import LogisticRegression #importing logistic regression

lr = LogisticRegression()

lr.fit(X_train, y_train)

pred = lr.predict(X_test)  # Predictions from logistic regression
score1 = lr.score(X_test, y_test)
score1
Out[41]:
0.6844262295081968
In [42]:
#Logistic Regression with only feedback columns
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 1, stratify = y)

lr = LogisticRegression()

lr.fit(X_train, y_train)

pred = lr.predict(X_test)

score2 = lr.score(X_test, y_test)

print(f'Number of features used = {len(X_train)}')
print(f'Accuracy = {score2}')
Number of features used = 75
Accuracy = 1.0
In [43]:
from sklearn.tree import DecisionTreeClassifier

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 1)

dt = DecisionTreeClassifier()

dt.fit(X_train, y_train)

score3 = dt.score(X_test, y_test)
pred = dt.predict(X_test)

print(f"Decision tree acccuracy score: {score3}")
Decision tree acccuracy score: 1.0
In [44]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier()

rf.fit(X_train, y_train)

score4 = rf.score(X_test, y_test)

print(f'Random Forest accuracy score = {score4}')
Random Forest accuracy score = 1.0
In [45]:
from sklearn.model_selection import cross_val_score
#For Decision Tree dt
score5 = cross_val_score(dt, X, y, cv = 10).mean()
print(f'Cross validation score of Decision tree = {score5}')
Cross validation score of Decision tree = 0.9800000000000001
In [46]:
#Random Forest rf
score6 = cross_val_score(rf, X, y, cv = 10).mean()
print(f'Cross validation score of Random forest = {score6}')
Cross validation score of Random forest = 1.0
In [47]:
from sklearn.model_selection import GridSearchCV

parameters = {'bootstrap': [True],
 'max_depth': [10, 20, 30, 40, 50],
 'max_features': ['auto', 'sqrt'],
 'min_samples_leaf': [1, 2, 4, 8],
 'n_estimators': [100]}


clf = GridSearchCV(RandomForestClassifier(), parameters, cv = 5, verbose = 2, n_jobs= 4)
clf.fit(X, y)

clf.best_params_
Fitting 5 folds for each of 40 candidates, totalling 200 fits
Out[47]:
{'bootstrap': True,
 'max_depth': 20,
 'max_features': 'sqrt',
 'min_samples_leaf': 1,
 'n_estimators': 100}
In [48]:
rf = RandomForestClassifier(bootstrap= True,
 max_depth= 30,
 max_features= 'sqrt',
 min_samples_leaf= 1,
 n_estimators= 100)
rf.fit(X_train, y_train)
score7 = cross_val_score(rf, X_train, y_train, cv = 5).mean()
score7
Out[48]:
0.9866666666666667
In [49]:
data = {'Technique' : ['Logistic Regression', "LR", 'Decision tree',
                       'Random forest', 'DT CV','RF CV','Tuned RF CV'],
       'Score' : [score1, score2, score3, score4, score5, score6, score7] }

result = pd.DataFrame(data)
In [50]:
result
Out[50]:
Technique Score
0 Logistic Regression 0.684426
1 LR 1.000000
2 Decision tree 1.000000
3 Random forest 1.000000
4 DT CV 0.980000
5 RF CV 1.000000
6 Tuned RF CV 0.986667

5.D. Use any other technique/method which can enhance the model performance.

In [58]:
X = df.drop(labels='Pass/Fail',axis=1)
y = df['Pass/Fail']
In [59]:
X = X.add_prefix('f')
In [60]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 0)
In [61]:
from sklearn.decomposition import PCA
pca = PCA(10)# Initialize PCA object
#pca = PCA(.95)
pca.fit(X_train)
Out[61]:
PCA(n_components=10)
In [62]:
X_train_pca = pca.transform(X_train)  # PCs for the train data
X_test_pca = pca.transform(X_test)    # PCs for the test data
X_train_pca.shape, X_test_pca.shape
Out[62]:
((1175, 10), (392, 10))
In [63]:
pca.explained_variance_
Out[63]:
array([54061643.68172596, 21394156.54857915,  8145283.99714214,
        2085792.20943108,  1338221.83028488,   245275.6629994 ,
         221478.17779838,   191983.20758722,   114605.770092  ,
         106692.98191255])
In [64]:
lr = LogisticRegression()
lr.fit(X_train_pca, y_train)
score9 = lr.score(X_test_pca, y_test)
score9
Out[64]:
0.9464285714285714
In [65]:
dt = DecisionTreeClassifier()
dt.fit(X_train_pca, y_train)
score10 = dt.score(X_test_pca, y_test)
score10
Out[65]:
0.8443877551020408
In [66]:
rf = RandomForestClassifier(bootstrap = True, max_depth = 30, max_features ='auto', min_samples_leaf = 1, n_estimators = 100)
rf.fit(X_train_pca, y_train)
score11 = rf.score(X_test_pca, y_test)
score11
Out[66]:
0.9489795918367347
In [69]:
rf = RandomForestClassifier(bootstrap = True, max_depth = 30, max_features ='auto', min_samples_leaf = 1, n_estimators = 100)
rf.fit(X_train_pca, y_train)
score11 = rf.score(X_test_pca, y_test)
score11
Out[69]:
0.9438775510204082
In [70]:
lr = LogisticRegression()
score12 = cross_val_score(lr,X_train_pca, y_train , cv = 5).mean()

dt = DecisionTreeClassifier()
score13 = cross_val_score(dt, X_train_pca, y_train, cv = 5).mean()

rf = RandomForestClassifier(bootstrap = True, max_depth = 10, max_features ='sqrt', min_samples_leaf = 1, n_estimators = 100)
score14 = cross_val_score(rf, X_train_pca, y_train, cv = 5).mean()
In [71]:
result = pd.DataFrame({'Algorithm' : ['Logistic Regression', 'Deision Tree', 'Random Forest'],
                      'Accuracy_score': [score9, score10, score11],
                      'Cross_val_score' : [score12, score13, score14]})
result
Out[71]:
Algorithm Accuracy_score Cross_val_score
0 Logistic Regression 0.946429 0.929362
1 Deision Tree 0.844388 0.860426
2 Random Forest 0.943878 0.926809

5.E. Display and explain the classification report in detail.

In [81]:
X_train,X_test,y_train,y_test = train_test_split(X,y,random_state=50)  

#instantiate the model
logistic_regression = LogisticRegression()

#fit the model using the training data
logistic_regression.fit(X_train,y_train)

#use model to make predictions on test data
y_pred = logistic_regression.predict(X_test)
In [82]:
print(classification_report(y_test, y_pred))
              precision    recall  f1-score   support

           0       0.00      0.00      0.00        25
           1       0.94      1.00      0.97       367

    accuracy                           0.94       392
   macro avg       0.47      0.50      0.48       392
weighted avg       0.88      0.94      0.91       392

In [83]:
#instantiate the model
dt =DecisionTreeClassifier()

#fit the model using the training data
dt.fit(X_train,y_train)

#use model to make predictions on test data
y_pred = logistic_regression.predict(X_test)
In [84]:
print(classification_report(y_test, y_pred))
              precision    recall  f1-score   support

           0       0.00      0.00      0.00        25
           1       0.94      1.00      0.97       367

    accuracy                           0.94       392
   macro avg       0.47      0.50      0.48       392
weighted avg       0.88      0.94      0.91       392

In [85]:
#instantiate the model
rf = RandomForestClassifier(bootstrap = True, max_depth = 10, max_features ='sqrt', min_samples_leaf = 1, n_estimators = 100)

#fit the model using the training data
rf.fit(X_train,y_train)

#use model to make predictions on test data
y_pred = logistic_regression.predict(X_test)
In [86]:
print(classification_report(y_test, y_pred))
              precision    recall  f1-score   support

           0       0.00      0.00      0.00        25
           1       0.94      1.00      0.97       367

    accuracy                           0.94       392
   macro avg       0.47      0.50      0.48       392
weighted avg       0.88      0.94      0.91       392

In [102]:
cm = confusion_matrix(y_test, y_pred)
df_cm = pd.DataFrame(cm, index = [i for i in ["No","Yes"]],
                  columns = [i for i in ["No","Yes"]])
plt.figure(figsize = (7,5))
sns.heatmap(df_cm, annot=True ,fmt='g');

5.F. Apply the above steps for all possible models that you have learnt so far

Bagging :

In [105]:
from sklearn.ensemble import BaggingClassifier
bgcl = BaggingClassifier(base_estimator=dt, n_estimators=50,random_state=1)
bgcl = bgcl.fit(X_train, y_train)
In [107]:
from sklearn.metrics import confusion_matrix
y_predict = bgcl.predict(X_test)
print(bgcl.score(X_test , y_test))
cm=confusion_matrix(y_test, y_predict)
df_cm = pd.DataFrame(cm, index = [i for i in ["No","Yes"]],
                  columns = [i for i in ["No","Yes"]])
plt.figure(figsize = (7,5))
sns.heatmap(df_cm, annot=True ,fmt='g');
0.9285714285714286

AdaBoosting :

In [108]:
from sklearn.ensemble import AdaBoostClassifier
ada_bcl = AdaBoostClassifier(n_estimators=10, random_state=1)
ada_bcl = ada_bcl.fit(X_train, y_train)
In [110]:
y_predict = ada_bcl.predict(X_test)
print(ada_bcl.score(X_test , y_test))
cm=confusion_matrix(y_test, y_predict)
df_cm = pd.DataFrame(cm, index = [i for i in ["No","Yes"]],
                  columns = [i for i in ["No","Yes"]])
plt.figure(figsize = (7,5))
sns.heatmap(df_cm, annot=True ,fmt='g');
0.9311224489795918

GradientBoost :

In [111]:
from sklearn.ensemble import GradientBoostingClassifier
gra_bcl = GradientBoostingClassifier(n_estimators = 50,random_state=1)
gra_bcl = gra_bcl.fit(X_train, y_train)
In [112]:
y_predict = gra_bcl.predict(X_test)
print(gra_bcl.score(X_test, y_test))
cm=confusion_matrix(y_test, y_predict)
df_cm = pd.DataFrame(cm, index = [i for i in ["No","Yes"]],
                  columns = [i for i in ["No","Yes"]])
plt.figure(figsize = (7,5))
sns.heatmap(df_cm, annot=True ,fmt='g');
0.9311224489795918
In [141]:
# first, initialize the classificators
tree= DecisionTreeClassifier(random_state=24) # using the random state for reproducibility
forest= RandomForestClassifier(random_state=24)
knn= KNeighborsClassifier()
svm= SVC(random_state=24)
xboost= XGBClassifier(random_state=24)

# now, create a list with the objects 
models= [tree, forest, knn, svm, xboost]

for model in models:
    model.fit(X_train, y_train) # fit the model
    y_pred= model.predict(X_test) # then predict on the test set
    accuracy= accuracy_score(y_test, y_pred) # this gives us how often the algorithm predicted correctly
    clf_report= classification_report(y_test, y_pred) # with the report, we have a bigger picture, with precision and recall for each class
    print(f"The accuracy of model {type(model).__name__} is {accuracy:.2f}")
    print(clf_report)
    print("\n")
The accuracy of model DecisionTreeClassifier is 0.85
              precision    recall  f1-score   support

           0       0.03      0.04      0.03        25
           1       0.93      0.91      0.92       367

    accuracy                           0.85       392
   macro avg       0.48      0.48      0.48       392
weighted avg       0.88      0.85      0.86       392



The accuracy of model RandomForestClassifier is 0.93
              precision    recall  f1-score   support

           0       0.00      0.00      0.00        25
           1       0.94      0.99      0.96       367

    accuracy                           0.93       392
   macro avg       0.47      0.50      0.48       392
weighted avg       0.88      0.93      0.90       392



The accuracy of model KNeighborsClassifier is 0.93
              precision    recall  f1-score   support

           0       0.00      0.00      0.00        25
           1       0.94      0.99      0.96       367

    accuracy                           0.93       392
   macro avg       0.47      0.50      0.48       392
weighted avg       0.88      0.93      0.90       392



The accuracy of model SVC is 0.94
              precision    recall  f1-score   support

           0       0.00      0.00      0.00        25
           1       0.94      1.00      0.97       367

    accuracy                           0.94       392
   macro avg       0.47      0.50      0.48       392
weighted avg       0.88      0.94      0.91       392



The accuracy of model XGBClassifier is 0.94
              precision    recall  f1-score   support

           0       0.00      0.00      0.00        25
           1       0.94      1.00      0.97       367

    accuracy                           0.94       392
   macro avg       0.47      0.50      0.48       392
weighted avg       0.88      0.94      0.91       392



6. Post Training and Conclusion:

6.A. Display and compare all the models designed with their train and test accuracies

In [159]:
report=pd.DataFrame({'Algorithms':[logistic_regression,dt,rf,bgcl,ada_bcl,
                    gra_bcl,knn,SVC,XGBClassifier],
       'Accuracy':[0.94,0.94,0.94,0.92,0.93,0.93,0.93,0.94,0.94]
      })
In [160]:
report
Out[160]:
Algorithms Accuracy
0 LogisticRegression() 0.94
1 DecisionTreeClassifier() 0.94
2 (DecisionTreeClassifier(max_depth=10, max_feat... 0.94
3 (DecisionTreeClassifier(random_state=102886208... 0.92
4 (DecisionTreeClassifier(max_depth=1, random_st... 0.93
5 ([DecisionTreeRegressor(criterion='friedman_ms... 0.93
6 KNeighborsClassifier() 0.93
7 <class 'sklearn.svm._classes.SVC'> 0.94
8 <class 'xgboost.sklearn.XGBClassifier'> 0.94

6.B. Select the final best trained model along with your detailed comments for selecting this model.

  • The model above 0.94 will be the best models.

6.C. Pickle the selected model for future use.

In [161]:
import pickle
In [163]:
regressor = LogisticRegression()
#Fitting model with training data
regressor.fit(X, y)
Out[163]:
LogisticRegression()
In [164]:
pickle.dump(regressor, open('model.pkl','wb'))

6.D. Write your conclusion on the results

  • The models of logisticregression,Decisionclassifier,SVC,XGBClassifier are the models which gives us the good accuracy when balanced the data when compared to other models.